I-BERT: Integer-Only BERT Quantization

127

TABLE 5.2

Quantization results for BERT-base on SST-2. Results are obtained with 128 groups in

each layer.

Method

w-bits

e-bits

Acc

Size

Size-w/o-e

Baseline

32

32

93.00

415.4

324.5

Q-BERT

8

8

92.88

103.9

81.2

DirectQ

4

8

85.67

63.4

40.6

Q-BERT

4

8

92.66

63.4

40.6

DirectQ

3

8

82.86

53.2

30.5

Q-BERT

3

8

92.54

53.2

30.5

Q-BERT(MP)

2/4(MP)

8

92.55

53.2

30.5

DirectQ

2

8

80.62

43.1

20.4

Q-BERT

2

8

84.63

43.1

20.4

Q-BERT(MP)

2/3(MP)

8

92.08

48.1

25.4

Note: The quantization bits used for weights is abbreviated as “w-bits,” embedding as

“e-bits,” model size in MB as “Size,” and model size without embedding layer in MB as

“Size-w/o-e.” For simplicity and efficacy, all the models except for Baseline are using 8-bits

activation. Here “MP” refers to mixed-precision quantization.

(number of heads) value matrices Wv are concatenated together, resulting in a 3-d tensor.

For layer-wise quantization, as shown in Fig. 5.6(a), the entire 3-d tensor will be quantized

into the same range of discrete numbers. A special case of group-wise quantization is that

each dense matrix is a group, and every matrix can have its own quantization range as

shown in Fig. 5.6(b). A more general case in Fig. 5.6(c) instead provides a more general case

where each dense matrix with respect to output neuron is partitioned, and every continuous

d

2Nh output neurons is bucketed as a group.

The results of Q-BERT on the development set of SST-2 are presented Table 5.2. SST-2

is a movie review dataset with binary annotations, where the binary label indicates positive

and negative reviews. It can be seen that Q-BERT outperform the baseline by a large margin

over various bit pricsions.

5.4

I-BERT: Integer-Only BERT Quantization

Kim et al. [118] propose I-BERT to construct an integer-only BERT. Their motivation

comes from the fact that previous quantization schemes for transformer-based language

models use simulated quantization (fake quantization), where all or part of operations in

the inference (e.g., GELU, Softmax, and Layer Normalization) are carried out with floating

point arithmetic. Such approaches are illustrated in the left side of Fig. 5.4. However, such

approaches are hard to deploy in real-edge application scenarios where many neural accel-

erators or popular edge processors do not support floating-point arithmetic. To solve these

challenges, an integer-only quantization for Bert is necessary. Specifically, the proposed

I-BERT incorporates a series of novel integer-only quantization schemes for transformer-

based language models including new kernels for the efficient and accurate integer-only